Text Segmentation into Paragraphs Based on Local Text Cohesion
نویسندگان
چکیده
The problem of automatic text segmentation is subcategorized into two different problems: thematic segmentation into rather large topically selfcontained sections and splitting into paragraphs, i.e., lexico-grammatical segmentation of lower level. In this paper we consider the latter problem. We propose a method of reasonably splitting text into paragraph based on a text cohesion measure. Specifically, we propose a method of quantitative evaluation of text cohesion based on a large linguistic resource – a collocation network. At each step, our algorithm compares word occurrences in a text against a large DB of collocations and semantic links between words in the given natural language. The procedure consists in evaluation of the cohesion function, its smoothing, normalization, and comparing with a specially constructed threshold.
منابع مشابه
Segmentation and segment cohesion: On the thematic organization of the text*
Most linguists concerned with cohesion have focused on the linear relations between sentences. This study is an attempt to extend the notion of cohesion beyond the sentence level, by viewing it as a requirement of the text for connectedness between segments larger than a sentence, such as paragraphs or whole chapters. Of the various concatenation devices listed in Danes( 1974), the most element...
متن کاملA Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling
In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...
متن کاملOptimal Multi-Paragraph Text Segmentation by Dynamic Programming
There exist several methods of calculating a similarity curve, or a sequence of similarity values, representing the lexical cohesion of successive text constituents, e.g., paragraphs. Methods for deciding the locations of fragment boundaries are, however, scarce. We propose a fragmentation method based on dynamic programming. The method is theoretically sound and guaranteed to provide an optima...
متن کاملA Modified Character Segmentation Algorithm for Farsi Printed Text Using Upper Contour Labelling
In this paper, a modified segmentation algorithm for printed Farsi words is presented. This algorithm is based on a previous work by Azmi that uses the conditional labeling of the upper contour to find the segmentation points. The main objective is to improve the segmentation results for low quality prints. To achieve this, various modifications on local baseline detection, contour labeling an...
متن کاملText Line Aggregation
We present a new approach to text line aggregation that can work as both a line formation stage for a myriad of text segmentation methods (over all orientations) and as an extra level of filtering to remove false text candidates. The proposed method is centred on the processing of candidate text components based on local and global measures. We use orientation histograms to build an understandi...
متن کامل